Introduction to Open Data Science - Course Project

1: Open Data Science: Introduction

My name is Jonathan, and I am a first-year PhD student at the Department of Forest Sciences at UH within the doctoral programme in atmospheric sciences. I am developing disturbance modules (windthrows, snow breakage etc.) for a forest growth model. I have been working a lot with R and RStudio in the last years, and am taking my first steps with Github. I am a big fan of the idea of open research and data (partially probably because I depend on them), so this course should be my cup of tea, but I also would like to get a few hints on best practices on R and Github workflows, as my approach until now has been largely learning-by-doing/trial-and-error, and I am sure I could do things a lot more efficiently and transparently (also for my future self). In addition, the later parts of the course pertaining dimensionality reduction and classification approaches will hopefully help me to make more of sense of and analyse the large datasets I base the disturbance modelling on.

You can find my version of the IODS repository here: My IODS on Github


2: Exercise 2: Analysis / linear regression

(The data wrangling script can be found here)

1: Reading data & short description

The data was collected in a survey of students’ learning attitudes in a introductory course to statistics at the University of Helsinki in 2014 and 2015. Questions have been aggregated into three dimensions/combination categories (deep, surface, and strategic learning), and the students’ answers to individual questions have been averaged within those categories (Likert scale, 1 = strongly disagree, 5 = strongly agree). Additional background variables for students are age, attitude (‘Global attitude toward statistics’), gender, and exam points (students with an exam score of 0 have been excluded). The data set contains 166 observations/students of the above-mentioned 7 variables.

l14 <- read.csv("data/learning2014.csv")
str(l14)
## 'data.frame':    166 obs. of  7 variables:
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
dim(l14)
## [1] 166   7
head(l14)
##   Age Attitude Points gender     deep     surf  stra
## 1  53       37     25      F 3.583333 2.583333 3.375
## 2  55       31     12      M 2.916667 3.166667 2.750
## 3  49       25     24      F 3.500000 2.250000 3.625
## 4  53       35     10      M 3.500000 2.250000 3.125
## 5  49       37     22      M 3.666667 2.833333 3.625
## 6  38       38     21      F 4.750000 2.416667 3.625

2: Data overview

First a glance at the ranges of the variables (top two lines), and the corresponding means (bottom line):

##      Age Attitude Points     deep     surf stra
## [1,]  17       14      7 1.583333 1.583333 1.25
## [2,]  55       50     33 4.916667 4.333333 5.00
##       Age  Attitude    Points      deep      surf      stra 
## 25.512048 31.427711 22.716867  3.679719  2.787149  3.121235

On average, the statements subsumed under surface learning scored the lowest levels of agreement, while the deep learning statements show the highest levels of agreement, with the strategic learning statements in between.

About twice as many females as males participated in the survey.

## 
##   F   M 
## 110  56

In the following combination plots, the correlations between all variables are depicted:

I won’t go through all combinations, but limit myself to the most interesting ones.

  • The clearest correlation exists between the variables attitude and points, i.e. a higher score in attitude seems to be significantly positively correlated with the number of points achieved.
  • A high score in deep learning seems to be negatively correlated with the score in surface learning, but this seems to be true only for males.
  • The same is true for the correlation between surface learning and attitude (significant only amongst males)
  • There seem to be relatively more female respondents with a lower score in attitude.

3: Regression model

As the simple linear correlation between the respondents’ learning attitude scores (deep/surface/strategic) and the exam score seems to be low, I will have a closer look at the background information variables (age, gender, attitude) and their correlation with the the exam score here.

lmod1 <- lm(Points ~ Attitude + gender + Age, data=l14)
summary(lmod1)
## 
## Call:
## lm(formula = Points ~ Attitude + gender + Age, data = l14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.4590  -3.3221   0.2186   4.0247  10.4632 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.42910    2.29043   5.863 2.48e-08 ***
## Attitude     0.36066    0.05932   6.080 8.34e-09 ***
## genderM     -0.33054    0.91934  -0.360    0.720    
## Age         -0.07586    0.05367  -1.414    0.159    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared:  0.2018, Adjusted R-squared:  0.187 
## F-statistic: 13.65 on 3 and 162 DF,  p-value: 5.536e-08

According to the model summary, only the attitude variable has a significant (positive) effect on the number of points achieved, with a p value well below the traditional significance threshold of 0.05. Male and older participants may have performed slightly worse, but these correlations are not significant according to the t-statistic applied here, i.e. the null hyphotheses that there is no relationship between gender and age and point score, respectively, should be accepted here. Accordingly, gender and age as explanatory variables are dropped from the model from here on.

Note that these judgements on the significance of the explanatory power of the variables are only valid for linear relationships here.

4: Analysing the model

This reduction results in a simple linear model with only one predictor for exam points, namely attitude.

lmod2 <- lm(Points ~ Attitude, data=l14)
summary(lmod2)
## 
## Call:
## lm(formula = Points ~ Attitude, data = l14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The resulting linear model predicts an intercept of 11.6, i.e. a student with a (hypothetical) attitude score of 0 would be predicted to score about 12 points in the exam. The estimate of the \(\beta 1\) parameter of 0.35 means that for each additional score point in attitude, the predicted exam point score increases by 0.35. Note that this very simple model can only predict exam point scores between roughly 16.5 and 29 based on the range of attitude values in the data (14 to 50) with the estimated relationship of

\(points = 11.6 + 0.35*attitude\)

The ‘multiple’ \(R^2\) of the model is about 0.19, meaning that it is able to explain about 19% of the total variation of the response variable. Note that the ‘multiple’ \(R^2\) does not really make sense here, as it sums up the proportions of variance explained by each predictor of the linear model. In the case of this simple model, there is just one predictor, so multiple \(R^2\) = \(R^2\).

5: Graphical model validation

In order to validate the applicability of this model, it is important to check in how far the model assumptions are met, here by graphically analysing the residuals.

1. Are the errors normally distributed?

To answer this, let’s have a look at the model’s QQ-plot. If the errors are normally distributed, the standardised residuals and the theoretical quantiles should follow approximately a 1:1 relationship (dashed line in the plot).

Even though there are deviations at the higher and lower ends, I would still say that the normality assumption is met here, as the deviations are rather minor, and the vast majority of the observations follow the 1:1 relationship very neatly. In case I wanted to publish something relying on this assumption, I would still check with somebody with a little more experience in model validation.

2. Do the errors have a constant variance, i.e. does the size of the error not change with the explanatory variable?

This can be checked with a scatterplot of residuals vs predicted values; it should not depict any clear trend or pattern.

The residuals vs predicted values plot for the model here does not show a clear pattern; there is a little reduction in variance at the higher end of the fitted values, but there are also very few data points in this range. I would conclude that the assumption of a constant variance should be warranted here.

3. Do individual observations heavily influence the model parameters / have high leverage?

To check if individual data points have a big influence on the predictions, Cook’s distance is helpful; high values indicate the influence of individual data points.

In this case, the influence of individual data points can be considered very low. Generally, Cook’s distance values of 0.5 or 1 should be investigated (well above the maximum Cook’s distance in this instance).


Exercise 3: Logistic regression: analysis

library(dplyr)
library(ggplot2)

(The data wrangling script can be found here)

1: Creating this file

2: Reading data & short description

metadata source

‘This data approach[es] student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).’

The data has been pre-processed, and is here used to analyse the student’s alcohol consumption (one of the ‘social features’, probably) and a binary variable $high_use denotes high alcohol usage (>2 averaged over weekend and workday consumption, scale: 1 (very low) - 5 (very high)).

Below you find an overview over the variables in the data, and a few sample observations. Overall, there are 382 observations/students of 38 variables (for more information on the individual variables, see metadata link above):

alc <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", header=T)
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
dim(alc)
## [1] 382  35
head(alc)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   nursery internet guardian traveltime studytime failures schoolsup famsup paid
## 1     yes       no   mother          2         2        0       yes     no   no
## 2      no      yes   father          1         2        0        no    yes   no
## 3     yes      yes   mother          1         2        3       yes     no  yes
## 4     yes      yes   mother          1         3        0        no    yes  yes
## 5     yes       no   father          1         2        0        no    yes  yes
## 6     yes      yes   mother          1         2        0        no    yes  yes
##   activities higher romantic famrel freetime goout Dalc Walc health absences G1
## 1         no    yes       no      4        3     4    1    1      3        6  5
## 2         no    yes       no      5        3     3    1    1      3        4  5
## 3         no    yes       no      4        3     2    2    3      3       10  7
## 4        yes    yes      yes      3        2     2    1    1      5        2 15
## 5         no    yes       no      4        3     2    1    2      5        4  6
## 6        yes    yes       no      5        4     2    1    2      5       10 15
##   G2 G3 alc_use high_use
## 1  6  6     1.0    FALSE
## 2  5  6     1.0    FALSE
## 3  8 10     2.5     TRUE
## 4 14 15     1.0    FALSE
## 5 10 10     1.5    FALSE
## 6 15 15     1.5    FALSE

3: 4 variables, 4 hypotheses

H 1. Common assumption that males drink more than females ($sex)
H 2. Learning interferes with drinking (or vice versa; high use positively correlated with $failures)
H 3. Hangovers reduce attendance (high use positively correlated with $absences)
H 4. When in a relationship, heavy drinking is of less interest (high use positively correlated with $romantic == no)

H 1: Common assumption that males drink more than females ($sex)

At a first glance, H1 seems to be supported by the data: while there are about four times as many females with low alcohol consumption as there are females with high consumption, there are less than twice as many males with low alcohol consumption compared to high consumption (see table below). The average absolute alcohol use (based on the alc_use variable) is also higher among males.

##    High use
## Sex FALSE TRUE
##   F   157   41
##   M   113   71
## [1] "Average alcohol use: F: 1.61, M: 2.16"

H 2. Learning interferes with drinking (or vice versa; high use positively correlated with $failures)

H2 seems to be supported as well; while amongst students who have not failed any courses, about a quarter indicate high alcohol consumption, about half of the students with 2 or 3 failed courses indicate high consumption.

##         High use
## failures FALSE TRUE
##        0   234   82
##        1    22   16
##        2     5    6
##        3     9    8

H 3. Hangovers reduce attendance (high use positively correlated with $absences)

H3 seems to be supported as well; on average, students with high alcohol consumption have about 1.8 more absences (~40%) than those with low consumption.

## [1] "Average absences: low use: 4.23, high use: 6"

The difference also becomes clear when looking at the histograms of absences for high and low alcohol usage separately; in the case of high usage (top), the tail is heavier than in that of lower usage (bottom):

H 4. When in a relationship, heavy drinking is of less interest (high use positively correlated with $romantic == “no”)

H4 does not seem to be strongly supported by the data, as the in between alcohol usage between those in and without a relationship do not differ greatly (in relationship: ~30% high usage, not in a relationship: ~27%)

##                  High use
## In a relationship FALSE TRUE
##               no    182   79
##               yes    88   33

5: Logistic regression model

Below you find the summary for a logistic regression model of high alcohol usage with sex, absences, failures, and being in a romantic relationship as predictors:

## 
## Call:
## glm(formula = high_use ~ sex + absences + failures + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6507  -0.8256  -0.6131   1.0339   2.1292  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8769     0.2381  -7.882 3.23e-15 ***
## sexM          0.9757     0.2451   3.980 6.88e-05 ***
## absences      0.0752     0.0182   4.132 3.59e-05 ***
## failures      0.4142     0.1511   2.741  0.00613 ** 
## romanticyes  -0.2806     0.2655  -1.057  0.29063    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.51  on 377  degrees of freedom
## AIC: 427.51
## 
## Number of Fisher Scoring iterations: 4

As already suspected at a first glance at the data distribution, male sex, absences and failures are positively and significantly correlated with high alcohol consumption, confirming H1, H2 and H3. H4 is not supported by the model, as being in a relationship does not correlate significantly with high consumption.

##                    OR      2.5 %    97.5 %
## (Intercept) 0.1530655 0.09415327 0.2400305
## sexM        2.6531230 1.65102057 4.3248387
## absences    1.0781046 1.04265670 1.1194724
## failures    1.5131929 1.12471665 2.0427535
## romanticyes 0.7553706 0.44426237 1.2613520
##  Named num [1:5] 0.153 2.653 1.078 1.513 0.755
##  - attr(*, "names")= chr [1:5] "(Intercept)" "sexM" "absences" "failures" ...

When looking at the odds ratios of the model coefficients, the following can be derived:
Sex: being male increases the odds to exhibit high use by 2.7 compared to females
Absences: each absence increases the likelyhood to exhibit high use by 0.08
Failures: each failed course increases the likelyhood to exhibit high use by 0.51
Romantic: being in a relationship decreases the likelyhood to exhibit high use by 0.25 (though this correlation is insignificant)

The confidence intervals: are widest for the the sex variable, so its effect (in absolute terms1) is the most uncertain.

The only predictor with a 95% confidence interval that includes 1 is that of being in a relationship. In practice, this means that it is unclear if the predictor has a positive (odds ratio >1) or negative (odds ratio <1) correlation with the target variable high_use - in other words, the predictor is not significantly correlated with high_use (see model summary).

6: Predictive power of the model

To analyse the predicitve power of the model, only the predictor variables with significant correlation with high alcohol usage were used, i.e. high_use ~ sex + absences + failure (the instructions about dropping insignificantpredictors were a bit unclear to me). Below you find a summary of the model (largely equivalent to m1):

## 
## Call:
## glm(formula = high_use ~ sex + absences + failures, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6629  -0.8545  -0.5894   1.0339   2.0428  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.95397    0.22819  -8.563  < 2e-16 ***
## sexM         0.98848    0.24453   4.042 5.29e-05 ***
## absences     0.07294    0.01796   4.061 4.88e-05 ***
## failures     0.40462    0.15024   2.693  0.00708 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 418.64  on 378  degrees of freedom
## AIC: 426.64
## 
## Number of Fisher Scoring iterations: 4

This model was then used to predict high_use (input: training data/whole data set); here’s are two tables cross tables of predicted and observed high_use (top: N, bottom: proportions)

##           observation
## prediction FALSE TRUE Sum
##      FALSE   258   86 344
##      TRUE     12   26  38
##      Sum     270  112 382
##           observation
## prediction      FALSE       TRUE        Sum
##      FALSE 0.67539267 0.22513089 0.90052356
##      TRUE  0.03141361 0.06806283 0.09947644
##      Sum   0.70680628 0.29319372 1.00000000

Here, it becomes clear that the model is rather conservative in predicting high alcohol usage, predicting only 10% of the students falling within this definition (data: 29%), correspondingly producing relatively more false negatives (3%) than false positives (23%). The training error can be calculated as

training error = false positives (0.031) + false negatives (0.225) = 0.256

… this is confirmed by the loss function provided in the course:

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

loss_func(class = alc$high_use, prob = alc$hu_prob)
## [1] 0.2565445

The higher prevalence of false negatives becomes also clear from a scatterplot (false negatives: red points on upper line, false positives; cyan points on the lower line):

A simple guessing scheme to ‘predict’ high alcohol use can be formulated as

set.seed(1)
alc$hu_guess <- sample(x=c(rep(FALSE, 270), rep(TRUE, 112)), size=nrow(alc), replace=T)

i.e. the probability of assigning high usage to a student is based on the prevalence of high usage in the data. This simple guessing strategy’s ‘predictive power’ can be assessed in the same way as the above model:

table(guess = alc$hu_guess, observation = alc$high_use) %>% prop.table %>% addmargins
##        observation
## guess       FALSE      TRUE       Sum
##   FALSE 0.4816754 0.1858639 0.6675393
##   TRUE  0.2251309 0.1073298 0.3324607
##   Sum   0.7068063 0.2931937 1.0000000
loss_func(class = alc$high_use, prob = alc$hu_guess)
## [1] 0.4109948

training error = false positives 0.225 + false negatives 0.186 = 0.411

The overall predictive power of the model seems therefore superior to bare guessing.

An interesting closing remark: Note that guessing in fact predicts high usage correctly more often (~11%) than the model (~7%) (!).


Exercise 4: LDA

library(MASS)
library(dplyr)
library(ggplot2)
library(corrplot)
library(knitr)

(The data wrangling script for next week can be found here)

1: Creating this file

2: Reading data & short description

metadata source 1

metadata source 2

The Boston dataset consists of census data on various factors, mostly pertaining social and infrastructure, but also nitric oxide concentration for suburbs of Boston. Below you find an overview over the variables in the data, and a few sample observations. Overall, there are 506 observations (I suspect these correspond to some sort of administrative districts/segments) of 14 variables (for more information on the individual variables, see metadata links above):

data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7

3: (Graphical) overview over data

Below you find summaries of the variables as well as pairwise scatterplots for all variable combinations; I won’t go through all of them, but pick a few interesting ones.

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

The crime rate seems to grow somewhat exponentially with the variable age, i.e. with the proportion of pre-WW2 housing:

plot(crim~age, data=Boston)

Apparently, substantial criminal activity only happens where the property tax rate is ~6.7% :-)
[Downtown areas with common tax rate?]

plot(crim~tax, data=Boston)

The nitric oxide concentration decreases with increasing distance to employment centres:

plot(nox~dis, data=Boston)

And, not surprisingly, the median value of housing (medv) seems to increase with the mean number of rooms (rm).

plot(medv~rm, data=Boston)

4: Standardisation, categorisation and training/testing subsets

As LDA assumes that variables are normally distributed and that variables have the same variance, standardising/scaling the data is often necessary, here with

\(scaled(x) = \displaystyle \frac{x-mean(x)}{sd(x)}\)

As you can see from the summary below, the variables are now centered around 0 and have values that are of the same magnitude, and differences in their variation/distribution are comparable.

boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

For example, the nitric oxide concentration ranged between 0.38 and 0.87 pp10m, while the property tax ranged between 187 and 711$ per 10000$, making direct comparisons about the distributions difficult, while the values now range between -1.5 and 2.7 and -1.3 and 1.8, respectively:

summary(Boston[c(5, 10)])
##       nox              tax       
##  Min.   :0.3850   Min.   :187.0  
##  1st Qu.:0.4490   1st Qu.:279.0  
##  Median :0.5380   Median :330.0  
##  Mean   :0.5547   Mean   :408.2  
##  3rd Qu.:0.6240   3rd Qu.:666.0  
##  Max.   :0.8710   Max.   :711.0
summary(boston_scaled[c(5, 10)])
##       nox               tax         
##  Min.   :-1.4644   Min.   :-1.3127  
##  1st Qu.:-0.9121   1st Qu.:-0.7668  
##  Median :-0.1441   Median :-0.4642  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 1.5294  
##  Max.   : 2.7296   Max.   : 1.7964

Converting crim to a categorical variable with four levels corresponding to the quantiles

# extract quantiles
quantiles <- quantile(boston_scaled$crim)
# categorise crim variable according to quantiles, add to boston_scaled df
boston_scaled$crime  <- cut(boston_scaled$crim, breaks = quantiles, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
# remove continuous crim variable
boston_scaled <- subset(boston_scaled, select=(-crim))
# randomly assign testid to 80% of observations 
set.seed(42)
testid <- sample(nrow(boston_scaled),  size = nrow(boston_scaled) * 0.8)
train <- boston_scaled[testid,]
test <- boston_scaled[-testid,]

5: LDA

An LDA is fitted to the train dataset with crime as the target variable and all others as predictors:

lda.fit <- lda(crime ~ ., data = train)

# function for lda biplot arrows //c/p from datacamp
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 1.5)

To my understanding, the biplot shows that the LD1is to some extent able to separate/isolate a group of high and a few med_high observations for crime from the rest of the observations pretty well (cluster on the right), but has difficulties within the left cluster: there seems to be a trend of lower crime rate categories at lower LD1 scores, but in that dimension, the cluster is mainly overlap of categories. LD2 is able to further distinguish the left cluster (low at the upper, med_high at the lower ‘subcluster’). However, the left-central cluster still includes a lot of low, med_low and med_high crime rates, so these two LDs still leave a large proportion of the observations ‘unseparated’ (lower three categories only). I hope next week’s exercise will clarify the bi-plot concept a bit more for me. The arrows indicate that the access to radial highways ($rad) is the best discriminatory variable, followed by $zn (something about parcel sizes), and nitric oxideconcentration ($nox).

The following two plots make the arrows easier to read (‘zoom’/changed scale of the arrow lengths, some extend outside of plot):

plot(lda.fit, dimen = 2, col="lightgrey", pch=classes)
lda.arrows(lda.fit, myscale = 2)

plot(lda.fit, dimen = 2, col="lightgrey", pch=classes)
lda.arrows(lda.fit, myscale =6)

6: Predictions with LDA

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- subset(test, select=(-crime))

# predict
lda.pred <- predict(lda.fit, newdata = test)

# results
restab<- table(correct = correct_classes, predicted = lda.pred$class)
restab
##           predicted
## correct    low med_low med_high high
##   low       23      10        1    0
##   med_low    3      14        4    0
##   med_high   0      11       12    1
##   high       0       0        1   22

The model seems to predict the observations with high crime rate class pretty well, with just one test observation falsely predicted to be med_high instead of high, and one falsely predicted as high instead of med_high. However, the lower crime classes are not predicted as well; especially med_low was predicted for almost equal shares of low, med_low, and med_high classes in the test data. So, as far as I can see, the model is well-suited to predict areas of high crime rates, but cannot distinguish between lower rates particularly well.

7: Distances and k-means

First, read in dataset again, and scale variables (again to improve comprability, this time of distances)

# re-read dataset, scale, convert to df
data('Boston')
b_scld <- as.data.frame(scale(Boston))

Euclidian and Manhattan distances

dist_eu <- dist(b_scld)
dist_man <- dist(b_scld, method="manhattan")

rbind(euclidian=summary(dist_eu), manhattan=summary(dist_man))
##                Min.  1st Qu.    Median      Mean   3rd Qu.     Max.
## euclidian 0.1343256 3.462494  4.824125  4.911095  6.186333 14.39703
## manhattan 0.2661623 8.483217 12.609003 13.548796 17.756834 48.86182

The main difference between the euclidian and Manhattan distances is one of scale, i.e. the Manhattan distances are generally 2-3.5 times larger. In addition, the relative distance between the centres (mean/median) and the minimum and maximum are also relatively larger. If these are used for optimisation, I would guess that the Manhattan distance would give more weight to extreme values/outliers in an optimisation procedure.

k-means clustering

k-means clustering with 4 centres was applied to the Boston dataset:

# k-means clustering with 4 centres
km <- kmeans(Boston, centers = 4)

kable(data.frame(cluster= c(1,2,3,4), crim=round(c(0.2537560, 12.2991617, 0.9497621, 0.1184122),2)), format="html", table.attr = "style='width:20%;'")  
cluster crim
1 0.25
2 12.30
3 0.95
4 0.12

For some reason, the summary of kmeans objects is deprecated in Rmarkdown, so the above table is put together manually. It shows the cluster means of the crim variable according to the kmeans clustering with 4 clusters. This method as well seems to be able to separate one group/cluster (2) with high crime rates, while the others are very close to each other, especially compared to the magnitude of the mean of cluster 2.

To determine the optimal number of clusters for kmeans, the clustering is run for 1:k clusters (k being a guess for an overestimate), and the k above which the total within-cluster sum of squares drops significantly is chosen. Here, the optimal number of clusters is 2, as the plot of total within-cluster sum of squares below shows:

set.seed(42)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# plot 
qplot(x = 1:k_max, y = twcss, geom = 'line')

Running kmeans again with k=2, pairwise scatterplots of all variable combinations (black = cluster 1, red = cluster 2):

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

Again, I won’t go through all of the variable combinations, but look at the ones I had a glimpse at in the very beginning, and the one that seemed to influence the LDA clustering most.

While the age of housing seems to be largely omitted by the k-menas clustering, crime rates have apparently affected the clustering heavily; no observations with a crime rate higher than 4 were allocated to in the first cluster (referenced as low-crime cluster from here on), and very few observations in the second cluster have low crime rate values:

plot(crim~age, data=Boston, col = km$cluster)

max(Boston$crim[km$cluster==1])
## [1] 4.0974

Apparently, I am not alone with the interpretation that substantial criminal activity only happens where the property tax rate is ~6.7%:

plot(crim~tax, data=Boston, col = km$cluster)

Both nitric oxide concentration and distance to employment centres have a somewhat overlapping ‘edge’ between clusters, and have their cluster separation at one end of the range (high crime cluster at hig nox concentrations and low distance to emlyment centres). Note, however, the low-crimwe cluster observations at the highest nox concetrations.

plot(nox~dis, data=Boston, col = km$cluster)

Very few observations with high value of housing are allocated to the high-crime cluster; room count does not seem to have infleunced clustering significantly:

plot(medv~rm, data=Boston, col = km$cluster)

Access to radial highways (highest impact in LDA clustering above) is also very clearly separated by the 2-fold kmeans-clustering:

plot(rad~crim, data=Boston, col = km$cluster)

All in all, the 2-fold k-means clustering seems to work very well for some variables in the data, as did the LDA in predicting the highest crime rate observations. One should have a good look at autocorrelation before drawing conclusions, though.


Exercise 5: Dimensionality reduction

library(MASS)
library(tidyverse)
library(ggplot2)
library(corrplot)
library(knitr)
library(GGally)
library(FactoMineR)

(The data wrangling script can be found here: part 1, part 2)

1: Data overview

The dataset consists of 155 observations (countries) with 8 variables (all numerical) on social and health, and gender relation indicators (for further information, see data wrangling scripts and metadata)

human <- read.csv("data/human2.csv", row.names = 1)
dim(human)
## [1] 155   8
summary(human)
##  secedu_f.ratio      edu_exp         life_exp      lab_f.ratio    
##  Min.   :0.1717   Min.   : 5.40   Min.   :49.00   Min.   :0.1857  
##  1st Qu.:0.7264   1st Qu.:11.25   1st Qu.:66.30   1st Qu.:0.5984  
##  Median :0.9375   Median :13.50   Median :74.20   Median :0.7535  
##  Mean   :0.8529   Mean   :13.18   Mean   :71.65   Mean   :0.7074  
##  3rd Qu.:0.9969   3rd Qu.:15.20   3rd Qu.:77.25   3rd Qu.:0.8536  
##  Max.   :1.4967   Max.   :20.20   Max.   :83.50   Max.   :1.0380  
##     mat_mort        gni_cpta.n       adol_birth        rep_parl    
##  Min.   :   1.0   Min.   :   581   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  11.5   1st Qu.:  4198   1st Qu.: 12.65   1st Qu.:12.40  
##  Median :  49.0   Median : 12040   Median : 33.60   Median :19.30  
##  Mean   : 149.1   Mean   : 17628   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 190.0   3rd Qu.: 24512   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :1100.0   Max.   :123124   Max.   :204.80   Max.   :57.50
head(human)
##             secedu_f.ratio edu_exp life_exp lab_f.ratio mat_mort gni_cpta.n
## Afghanistan         0.1980     9.3     60.4      0.1987      400       1885
## Albania             0.9306    11.8     77.8      0.6855       21       9943
## Algeria             0.8613    14.0     74.8      0.2105       89      13054
## Argentina           0.9774    17.9     76.3      0.6333       69      22050
## Armenia             0.9895    12.3     74.7      0.7466       29       8124
## Australia           0.9968    20.2     82.4      0.8189        6      42261
##             adol_birth rep_parl
## Afghanistan       86.8     27.6
## Albania           15.3     20.7
## Algeria           10.0     25.7
## Argentina         54.4     36.8
## Armenia           27.1     10.7
## Australia         12.1     30.5
pairs(human)

corrplot(cor(human), type="upper")

Again, I won’t go through all variable combinations, but comment on a few interesting ones:

  • high maternal mortality seems to be limited to countries with low GNI
  • bot adolescent births and maternal mortality are negatively correlated with the ratio of females in secondary education, expected time of education, and life expectancy
  • the three latter variables are positively correlated with wealth (GNI)
  • unsurprisingly, there is a high positive correlation between adolescent births and maternal mortality

2: PCA, raw data

Here’s a PCA of the (unscaled) dataset:

pca_human <- prcomp(human)
psum <- summary(pca_human)
# Variability captured by PC (rounded to full %):
round(100*psum$importance[2, ], digits = 1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
biplot(pca_human, choices = 1:2, col=c("steelblue", "darkred"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

This PCA yields basically just one principal component explaining all the variance the PCA is able to explain, which coincides with GNI. This is not very surprising, as GNI is the variable with by far the highest variance in absolute terms from about 600 to 120,000 (second biggest range: 1:1100; see summary of the data above). As the PCA uses these absolute values/the variance they entail and maximises it, the role of GNI compared to other variables is exaggerated. All other PCs only explain very marginal variances (see variabilty capture table above), and all other variables are not considered in the PCA; this is also supported by the warning messages R prints when plotting the arrows corresponding to the variable: they are too short to even have a direction (=0).

3+4: PCA, scaled data

So, as mentioned in the Datacamp exercise, it is probably a good idea to standardise the data and run the PCA again:

pca_human2 <- prcomp(scale(human))
summary(scale(human))
##  secedu_f.ratio       edu_exp           life_exp        lab_f.ratio     
##  Min.   :-2.8189   Min.   :-2.7378   Min.   :-2.7188   Min.   :-2.6246  
##  1st Qu.:-0.5234   1st Qu.:-0.6782   1st Qu.:-0.6425   1st Qu.:-0.5485  
##  Median : 0.3503   Median : 0.1140   Median : 0.3056   Median : 0.2318  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5959   3rd Qu.: 0.7126   3rd Qu.: 0.6717   3rd Qu.: 0.7351  
##  Max.   : 2.6645   Max.   : 2.4730   Max.   : 1.4218   Max.   : 1.6630  
##     mat_mort         gni_cpta.n        adol_birth         rep_parl      
##  Min.   :-0.6992   Min.   :-0.9193   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.6496   1st Qu.:-0.7243   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.4726   Median :-0.3013   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.1932   3rd Qu.: 0.3712   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 4.4899   Max.   : 5.6890   Max.   : 3.8344   Max.   : 3.1850
summary(pca_human2)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66195 0.53630 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96068 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
sum <- summary(pca_human2)
# Variability captured by PC (rounded to full %):
round(100*sum$importance[2, ], digits = 1)
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
biplot(pca_human2, choices = 1:2, col=c("steelblue", "darkred"), main = "Biplot: Gender inequality: PCA using standardised values")
Biplot of a PCA using standardised values of gender inequality factors. Contrary to the non-standardised PCA, variables other than GNI are considered in the PCA. [Caption: check, but what's the difference to standard text?]

Biplot of a PCA using standardised values of gender inequality factors. Contrary to the non-standardised PCA, variables other than GNI are considered in the PCA. [Caption: check, but what’s the difference to standard text?]

Standardising (all variables now in comparable orders of magnitudes, see summary above) the data does make things a lot more detailed: there is more than one PC that explains substantial proportions of variance (see importance table above). For PC1, the decisive variables can be grouped into two (diametral) variable groups: maternal mortality and adolescent births versus expected time of education, life expectancy, GNI, and the ratio of females with secondary education. For PC2, the decisive factors are the ratio of females in the labour force and their representation in parlament.

5. MCA: tea

data(tea)
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
gather(tea) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

The tea dataset is based on a survey on tea drinking habits and product perception with 36 queestions, including some background information, with 300 respondents/observations. The answers are (frankly, horribly) enshrined in the factor levels (all variables but respondend age are factorial). The questions are not specified in the package metadata, but have to be extrapolated from the variable names, leaving ample room for (mis-) interpretation.

As I have enough trouble interpreting the MCA outputs as it is, I’ll reduce the number of variables considerably for the MCA.

keep_columns <- c("Tea", "sugar", "pub", "sex", "feminine")
tea2 <- dplyr::select(tea, one_of(keep_columns))

t.mca <- MCA(tea2, graph=F)
plot.MCA(t.mca, invisible=c("ind"), select="1")

From the MCA variable biplot/factor map, I would conclude that Earl Grey is more often drunk with sugar than are green and other black teas, as the corresponding locations in the biplot are closer to each other. Males tend to see tea as not feminine, while females see it as feminine more often (again, shorter distances); in addition, females seem to drink tea in pubs more often than males do. Individuals drinking black and green tea are very similar in other aspects, but rather different from those inclined to Earl Grey.

Dimension 1 is mostly influenced by sex (and feminine, whatever that exactly means), while dimension 2 is linked to the sugar/Earl Grey vs black & green / no sugar dichotomy. The difference between pub / no pub is equally depicted in both dimensions.


(more chapters to be added similarly as we proceed with the course!)